Skip to content

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211

Open
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization
Open

[VL] Add lazy per-column deserialization for Columnar Table Cache#12211
jackylee-ch wants to merge 1 commit into
apache:mainfrom
jackylee-ch:table-cache-lazy-deserialization

Conversation

@jackylee-ch
Copy link
Copy Markdown
Contributor

@jackylee-ch jackylee-ch commented Jun 1, 2026

What changes

This PR makes Velox table cache write V3 per-column framed bytes by default. Lazy materialization is a base table-cache capability; spark.gluten.sql.columnar.tableCache.partitionStats.enabled now only controls the optional stats/pruning payload.

  • Removes spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled.
  • Adds V3 no-stats serialization (statsLen=0) for the default lazy path.
  • Keeps V3 with stats for partition pruning when partition stats are enabled.
  • Keeps V2 stats and legacy raw bytes as native-capability / backward-read fallback paths.
  • Routes V3 cached bytes through projected native deserialization.
  • Adds JVM/native golden, lazy serde, and GHA benchmark coverage.

Performance

Four-environment comparison — eager V2 vs lazy V3, each without and with the optional
partition-stats payload (ColumnarTableCacheLazyDeserBenchmark):

  • V2 without stats = legacy raw Presto payload (eager full-batch decode, no pruning).
  • V2 with stats = framedSerializeWithStats (eager full-batch decode + partition-stats pruning).
  • V3 without stats = per-column lazy payload (default; lazy projected decode).
  • V3 with stats = per-column lazy payload + partition-stats pruning.

100M rows / 32 partitions / 16 columns / 3 iterations, Apple M5 Pro, JDK 8 runtime, real Gluten
(off-heap enabled, ColumnarCachedBatchSerializer). Read phases build one mode's cache at a time so
the full 100M fits. Times are avg ms, lower is better; relative is vs V2 without stats.

Cache footprint (storage memory)

Mode Footprint
V2 without stats 14542 MiB
V2 with stats 14558 MiB
V3 without stats 14543 MiB
V3 with stats 14565 MiB

Footprint is identical across all four modes — V3 per-column framing does not regress cache size
for flat data, and the stats payload is negligible.

Read latency (avg ms / relative speedup vs V2 no-stats)

Phase V2 no-stats V2 +stats V3 no-stats V3 +stats
read 1/16 cols, sum(c0) 8217 (1.0x) 7427 (1.1x) 1110 (7.4x) 1050 (7.8x)
read 4/16 cols, group+agg 9325 (1.0x) 8569 (1.1x) 2648 (3.5x) 2692 (3.5x)
filter + 2/16 cols (point lookup) 8232 (1.0x) 69.5 (118x) 1210 (6.8x) 60.6 (136x)
  • Projected reads: V3 lazy decodes only the requested columns, so it is 7.4x faster reading
    1 of 16 columns and 3.5x faster reading 4 of 16, versus eager V2 which decodes all 16.
  • Filtered point lookup: partition stats prune almost all batches (V2 +stats 118x), and V3
    additionally lazy-decodes only the surviving batches' projected columns, giving the best result at
    136x (V3 with stats). Lazy column-skip alone (V3 no-stats) is 6.8x.
  • All-columns read (decode everything, no skip) was measured separately at smaller scale and is
    on par with / slightly faster than V2 (V3 ~1.3x at 2M), confirming LazyVector adds no overhead
    when every column is materialized. It is omitted from the 100M table because the eager-V2 path
    decodes the full 100M x 16 off-heap and does not fit this 64 GiB laptop.

Net: V3 lazy per-column is a large win on projected/filtered reads (the common table-cache access
pattern) with identical cache footprint and no full-scan regression.

A GitHub Actions run on a larger-RAM runner can reproduce the same 100M comparison via the
Velox Backend (x86) workflow_dispatch benchmark job.

How was this patch tested?

  • ./dev/format-scala-code.sh
  • PATH="/opt/homebrew/opt/llvm@15/bin:$PATH" ./dev/format-cpp-code.sh
  • git diff --check upstream/main..HEAD
  • ruby -e 'require "yaml"; YAML.load_file(".github/workflows/velox_backend_x86.yml"); puts "yaml ok"'
  • ./.github/workflows/util/check.sh upstream/main
  • env CCACHE_DIR=/private/tmp/gluten-ccache ninja -C cpp/build velox/tests/CMakeFiles/velox_operators_test.dir/VeloxColumnarBatchSerializerTest.cc.o
  • ./build/mvn install -pl backends-velox -am -Pspark-3.5 -Pscala-2.12 -Pbackends-velox -DskipTests -Dexec.skip
  • Local benchmark runability smoke only, not used as PR performance data: Java 8, ColumnarTableCacheLazyDeserBenchmark with 1000 rows, 4 partitions, 1 iteration, phases build,read1,read4,readAll,filter.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DOCS labels Jun 1, 2026
@jackylee-ch jackylee-ch marked this pull request as draft June 1, 2026 04:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 58bd451 to d5a0502 Compare June 1, 2026 08:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from d5a0502 to 8e374db Compare June 1, 2026 09:05
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8e374db to 0f0ccd2 Compare June 1, 2026 09:08
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 0f0ccd2 to 8b09d6b Compare June 1, 2026 11:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch marked this pull request as ready for review June 1, 2026 14:20
@jackylee-ch
Copy link
Copy Markdown
Contributor Author

@yaooqinn PTAL

@yaooqinn
Copy link
Copy Markdown
Member

yaooqinn commented Jun 2, 2026

Thanks @jackylee-ch, V3 layout is a sensible extension of the cache-stats wire we landed in #12092 / #12196. Several things to discuss before this lands:

1. Benchmark needs to be re-run. The checked-in -results.txt is 10K rows / 4 partitions / 1 iteration on an Apple M5 Pro — Stdev=0 across the board because there's only one sample. Differences in the 1-3 ms range (e.g. "1.1X" at all-16-cols read, where lazy mode physically cannot be faster than eager) are noise. Also build 1.9X is surprising because V3 does N serializeSingleColumn calls vs V2's single-pass batchSerialize — the ordering legacy > V2 > V3 doesn't match the physical work done; this needs reruns on a server / GHA-equivalent runner with iter≥3 and 100M rows / 32 partitions (matching the code defaults). Please also add a cache memory footprint column — V3 per-col framing + getFlattenedRowVector() flattening Dictionary/Constant encodings could regress cache size significantly for dict-encoded payloads, and that's currently unmeasured.

2. Do we really need a new SQLConf? V3 functionally supersedes V2 (V3 frames also carry statsBlob), so this isn't a new behavioral feature — it's a wire-format upgrade. Adding a dedicated lazy.deserialization.enabled boolean commits Gluten to maintaining three cache paths (legacy / V2-stats / V3-lazy-and-stats) and a three-level fallback chain. Once we trust V3, we'd want to deprecate V2-stats, which means another deprecation cycle. Could we either (a) skip the conf and gate V3 behind partitionStats.enabled once it's stable, or (b) turn partitionStats.enabled into a string conf with off | v2 | v3 values? Configuration.md already warns "V3 is NOT backward compatible with V2 readers" + default=false — operationally nobody is going to flip this, so the conf risks being long-lived dead code.

3. Cross-language test parity vs #12196. V3 has no cpp-side byte-equal golden test; JVM-side tests synthesize their own frames via craftV3Framed. We just established the cpp-golden ↔ JVM-parser round-trip pattern in #12196 specifically because layout drift between halves is a correctness hazard. V3 needs the same: a framedSerializeWithStatsV3Golden cpp test pinning a byte-stable literal + a JVM parser round-trip over that same literal.

4. Smaller items.

  • All-null column case not covered (we hit the PrestoSerde uninit-values bug in [VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning #12092 development, same risk class for per-col path).
  • getFlattenedRowVector() side effect on Dictionary/Constant encoding not documented.
  • The // JNI pin outlives comment in deserializeV3 describes a non-issue (copies are made synchronously in step 6, the lazy loader doesn't depend on the pin) — please trim.
  • Two near-identical magic checks (parseFramedBytes byte[3] dispatch vs isV3Format 4-byte compare) — please consolidate.
  • Consider folding statsExtV3AvailableFlag and statsExtAvailableFlag into a single capability enum (Unknown | V2 | V3 | Unavailable) — two independent one-shot latches double the operational diagnosis surface.

Happy to file any of these as separate issues if it helps.

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8b09d6b to 09679ee Compare June 2, 2026 06:24
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 09679ee to ab9e0f7 Compare June 2, 2026 06:30
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab9e0f7 to 144e816 Compare June 2, 2026 06:47
@github-actions github-actions Bot removed the CORE works for Gluten Core label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 2 times, most recently from b77f4ab to 9a0f96a Compare June 2, 2026 07:28
@github-actions github-actions Bot removed the DOCS label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9a0f96a to b5b1906 Compare June 2, 2026 09:01
@github-actions github-actions Bot added the INFRA label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from 2b96545 to c3cc1bd Compare June 2, 2026 15:28
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot added the CORE works for Gluten Core label Jun 2, 2026
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from c3cc1bd to 97a6019 Compare June 3, 2026 03:42
@github-actions github-actions Bot added the DOCS label Jun 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 97a6019 to 9971c91 Compare June 3, 2026 03:52
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9971c91 to f576df8 Compare June 3, 2026 06:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f576df8 to f17dc6a Compare June 3, 2026 06:51
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f17dc6a to cda20eb Compare June 3, 2026 09:27
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from decdd0e to ab055c5 Compare June 3, 2026 14:16
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab055c5 to 2538fe5 Compare June 3, 2026 18:55
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 2538fe5 to 765794f Compare June 4, 2026 04:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

Write V3 per-column cache bytes by default for Velox table cache. Partition stats now only controls the optional stats/pruning payload: stats off writes a no-stats V3 frame, stats on writes V3 with stats, and older native libraries still fall back to V2 stats or legacy bytes.

Add the V3 no-stats JNI/native serializer, JVM parsing for statsLen=0, cross-language golden coverage, and GitHub Actions benchmark execution without committing local benchmark results.

Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336
@jackylee-ch jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 765794f to c7f9e2f Compare June 4, 2026 07:54
@github-actions github-actions Bot removed the INFRA label Jun 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants